AICE1006 - Data Analytics¶
Lecture 7 - Data Plotting (Advanced)¶
Interactive data visualization with plotly
Zhiwu Huang
Lecturer (Assistant Professor)
Vision, Learning and Control (VLC) Research Group
School of Electronics and Computer Science (ECS)
University of Southampton
Office Hour: Wed 2PM-3PM, Please book in advance.
Zhiwu.Huang@soton.ac.uk
Credit: Marco Forgione, Researcher, USI-SUPSI
Plotly in a nutshell¶
Plotly is a modern plotting library for Python, R, MATLAB, Julia, etc.
For Python, the reference documentation is available at https://plotly.com/python/
Plotly vs matplotlib¶
You can build high-quality visualizations with good old matplotlib. However,
- A lot of low-level code is required
- The visualizations are generally static
Plotly is a modern and powerful alternative. It provides:
- Concise high-level syntax for common data visualization
- Tight integration with pandas
- Interactive plots
Other alternatives exist: for instance seaborn
- Also concise and high-level
- Also integrated with pandas
- Not interactive
Plotly Express¶
The plotly express sub-module of plotly provides a high-level API for common visualizations. Covers many use cases.
import plotly.express as px
# import plotly # contains more advanced low-level functionalities for custom visualizations
Plotly express provides methods to load well-known datasets. Let us load the iris dataset
df_iris = px.data.iris() # several classic dataframes are included in plotly for demonstration purpose
df_iris.sample(5)
| sepal_length | sepal_width | petal_length | petal_width | species | species_id | |
|---|---|---|---|---|---|---|
| 131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica | 3 |
| 76 | 6.8 | 2.8 | 4.8 | 1.4 | versicolor | 2 |
| 56 | 6.3 | 3.3 | 4.7 | 1.6 | versicolor | 2 |
| 21 | 5.1 | 3.7 | 1.5 | 0.4 | setosa | 1 |
| 88 | 5.6 | 3.0 | 4.1 | 1.3 | versicolor | 2 |
Scatterplot¶
A scatterplot is the most common visualization for 2 numeric variables
fig = px.scatter(df_iris, x="petal_width", y="petal_length", width=1600, height=800) # specify dataframe and columns for x/y
fig.update_layout(font_size=20);
fig.show()
- Syntax:
px.scatter(df_iris, x="petal_width", y="petal_length", ...) - Axes labels automatically set to the column names
- Interactive!
Scatterplot cont'd¶
The marker color is commonly used as another dimension of visual analysis
fig = px.scatter(df_iris, x="petal_width", y="petal_length", color="species", width=1600, height=800) # specify dataframe and columns for x/y
fig.update_layout(font_size=20);
fig.show()
- Implemented with
color="species" - Legend automatically added
Scatterplot cont'd¶
The marker size provides yet another dimension of visual analysis
fig = px.scatter(df_iris, x="petal_width", y="petal_length", color="species", size="petal_width", width=1600, height=800)
fig.update_layout(font_size=20);
fig.show()
- Implemented with
size="petal_length"
Scatterplot cont'd¶
The interactive text displayed when hovering over a point may also be modified
fig = px.scatter(df_iris, x="petal_width", y="petal_length", color="species",
size="sepal_width", hover_data=["sepal_length"], width=1600, height=800)
fig.update_layout(font_size=20);
fig.show()
- Implemented with
hover_data=["petal_width"]
Scatterplot matrix¶
The scatterplot matrix is a useful visualization for several numeric variables. It is the collection of all possible combinations of scatterplots.
fig = px.scatter_matrix(df_iris, dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"],
color="species", width=1600, height=800)
fig.update_layout(font_size=20);
fig.show()
- Implemented with
px.scatter(...) - The variables to be analyzed correspond to the
dimensionsargument
Histograms & Box plots¶
Histograms & box plots may be used to represent the distribution of a single numerical variable
fig = px.histogram(df_iris, x="sepal_width", width=800, height=400)
fig.update_layout(font_size=20);
fig.show()
fig = px.box(df_iris, x="sepal_width", width=800, height=400)
fig.update_layout(font_size=20);
fig.show()
Multiple box plots¶
Multiple box plots may be constructed specifying a categorical variable for y...
fig = px.box(df_iris, x="sepal_width", y="species", width=800, height=400); fig.update_layout(font_size=20); fig.show()
... or for color
fig = px.box(df_iris, x="sepal_width", color="species", width=800, height=400); fig.update_layout(font_size=20); fig.show()
Multiple box plots cont'd¶
Note: the role of x and y may be interchanged
fig = px.box(df_iris, y="sepal_width", x="species", width=800, height=400); fig.update_layout(font_size=20); fig.show()
fig = px.box(df_iris, y="sepal_width", color="species", width=800, height=400); fig.update_layout(font_size=20); fig.show()
Bar plot¶
Bar plots are commonly used to represent a numeric variable vs. a categorical one. Example: aggregated group statistics
df_iris_mean = df_iris.groupby("species", as_index=False).mean()
df_iris_mean
| species | sepal_length | sepal_width | petal_length | petal_width | species_id | |
|---|---|---|---|---|---|---|
| 0 | setosa | 5.006 | 3.418 | 1.464 | 0.244 | 1 |
| 1 | versicolor | 5.936 | 2.770 | 4.260 | 1.326 | 2 |
| 2 | virginica | 6.588 | 2.974 | 5.552 | 2.026 | 3 |
fig = px.bar(df_iris_mean, x="species", y="petal_length", title="Average petal_length, by species"); fig.update_layout(font_size=20); fig.show()
Bar plot¶
Another example where a bar plot looks nice: data for different years
import plotly.express as px
data_canada_it = px.data.gapminder().query("country == 'Canada' or country == 'Italy'")
fig = px.bar(data_canada_it, x='year', y='pop', color="country", width= 1600, height=800)
fig.update_layout(font_size=20); fig.show()
#data_canada_it
Bar plot¶
Another example where a bar plot looks nice: data for different years
import plotly.express as px
data_canada_it = px.data.gapminder().query("country == 'Canada' or country == 'Italy'")
fig = px.bar(data_canada_it, x='year', y='pop', color="country", barmode="group", width= 1600, height=800)
fig.update_layout(font_size=20); fig.show()
Italian population is stable since the 80s, canadian population is still increasing
Pie chars¶
Pie charts give an intuitive representation of percentages.
df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
df.loc[df['pop'] < 5.e6, 'country'] = 'Other countries' # Represent only large countries
df.sample(3)
| country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
|---|---|---|---|---|---|---|---|---|
| 527 | Finland | Europe | 2007 | 79.313 | 5238460 | 33207.08440 | FIN | 246 |
| 683 | Hungary | Europe | 2007 | 73.338 | 9956108 | 18008.94444 | HUN | 348 |
| 779 | Italy | Europe | 2007 | 80.546 | 58147733 | 28569.71970 | ITA | 380 |
fig = px.pie(df, values='pop', names='country', title='Population of European continent', width= 1600, height=800)
fig.update_layout(font_size=20); fig.show()
Faceting¶
Faceting allows dealing with up to two categorical variables by repeating the same base plot on different rows/ columns.
Back to the tip dataset:
df_tip = px.data.tips()
df_tip.head(5)
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
- total_bill and tip are numeric quantities
- day and time are also (ordered) categorical
- sex and smoker are categorical variables with unspecified order
Simple scatterplot¶
How is the relation tip vs total_bill for the different days? We may use a scatterplot tip vs total_bill, colored by day.
fig = px.scatter(df_tip, x="total_bill", y="tip", color="day", width= 1600, height=800)
fig.update_layout(font_size=20); fig.show()
The result is not very clear...
Faceted Scatterplots¶
A facet columns may be used instead: generate separate plots for each day
fig = px.scatter(df_tip, x="total_bill", y="tip", facet_col="day", category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]}, width= 1800, height=500)
fig.update_layout(font_size=20); fig.show()
facet_col="day": repeat the scatterplot for the different values of the categorical variable day on columns- The
category_ordersdictionary specifies the order to be used for the categorical variables
Faceted Statterplots cont'd¶
Using facet rows and columns we may handle 2 categorical variables
fig = px.scatter(df_tip, x="total_bill", y="tip", facet_col="day", facet_row="time",
category_orders={"day": ["Thur", "Fri", "Sat", "Sun"], "time": ["Lunch", "Dinner"]},
width= 1600, height=700)
fig.update_layout(font_size=20); fig.show()
facet_col="day": day on columnsfacet_row="time": time on rows
Faceted Histograms¶
Histograms may also be modified with faceting
fig = px.histogram(df_tip, x="total_bill", facet_col="day", facet_row="smoker", color="sex",
category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]},
width= 1600, height=800, )
fig.update_layout(font_size=20); fig.show()
- 1 categorical variable (sex) handled with the
coloroption - 2 categorical variables (day/smoker) handled with rows/columns
Faceted Boxplot¶
fig = px.box(df_tip, x="day", y="total_bill",
facet_col="smoker",
category_orders={"day": ["Thur", "Fri", "Sat", "Sun"], "time": ["Lunch", "Dinner"]},
color="day",
width= 1600, height=800)
fig.update_layout(font_size=20); fig.show()
Animation: time as an extra dimension¶
In the following scatterplot, we visualize 4 properties for different countries in 2007 :
- gdpPercap (x position)
- lifeExp (y position)
- continent (marker color)
- population (marker size)
import plotly.express as px
df = px.data.gapminder()
fig = px.scatter(df.query("year==2007"), x="gdpPercap", y="lifeExp", size="pop", color="continent", hover_name="country", log_x=True,
title="GDP, life expectancy, continent, and population of countries in 2007", size_max=60, width=1400, height=600)
fig.update_layout(font_size=20); fig.show()
What if we want to see the evolution over time? An animation could be used!
Animation: time as an extra dimension¶
import plotly.express as px
df = px.data.gapminder()
fig = px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year",
size="pop", color="continent", hover_name="country",
log_x=True, size_max=45, range_x=[100,100000], range_y=[25,90],
width=1400, height=600)
fig.update_layout(font_size=20); fig.show()
animation time is well-suited to represent the year dimension!
Maps¶
Maps are the obvious representation of geographical data. They are similar to scatterplots
import pandas as pd
# covid-19 italian data downloaded from https://github.com/pcm-dpc/COVID-19/blob/master/dati-regioni/dpc-covid19-ita-regioni.csv on 27-08-2020
data_latest = pd.read_csv("dpc-covid19-ita-regioni.csv")
center = {"lat": 43.1, "lon": 12.3} # coordinates of center italy (Perugia)
fig = px.scatter_mapbox(data_latest, lon="long", lat="lat",
center=center,
size="totale_casi", # total cases
hover_data= ["denominazione_regione"], # region name
zoom=4)
fig.update_traces(textposition='top center')
fig.update_layout(
width=800,
height=800,
title_text='Italian COVID-19 total cases, updated on 27-08-2020',
#center=center
)
fig.update_layout(mapbox_style="carto-darkmatter") # warning! some styles require an account
fig.show()
Maps¶
can also be animated, as all other plotly visualizations.
center = {"lat": 43.1, "lon": 12.3}
fig = px.scatter_mapbox(data_latest, lon="long", lat="lat", # longitude, latitude
center=center,
size="totale_casi", # total cases
hover_data= ["denominazione_regione"], # region name
animation_frame="data", # date
zoom=4)
fig.update_traces(textposition='top center')
fig.update_layout(
width=800,
height=800,
title_text='Cases-Regions',
)
fig.update_layout(mapbox_style="carto-darkmatter") # warning! some styles require an account
fig.show()